5.6 EMP_identify_assay
Traditional project data often have a large number of sparse features (e.g., microbial OTU/ASV annotation tables with a large number of very low abundance species), which may be caused by sample contamination, library errors, sequencing bias, or annotation failures. To simplify the calculation and reduce the interference of these sparse features, this module provides two methods to filter the raw data.
5.6.1 Core microbial data
In the microbial species annotation results, it can be found that there are quite a lot of "rare species"(relatively low abundance or low frequency in the sample), which causes strong interference in identifying species with differences between groups, especially when screening key species. Machine learning algorithms (e.g. random forest, LEFse) can easily identify these "rare species" as "different species" between groups. Therefore, it is necessary to filter these "rare species" according to uniform criteria before formal analysis. The module EMP_identify_assay
introduces two important parameters to help filter: minnum
(minimum relative abundance) and min_ratio
(minimum ratio). First, in microbial data, any abundance below the specified minimum relative abundance is converted to 0. Subsequently, when the number of samples above the minimum relative abundance in any group is greater than the minimum ratio of the total number of samples in that group, the species is considered a core species, and the rest are classified as rare species and filtered out.
🏷️Example:
Use the moduleEMP_assay_extract
to extract the assay of taxonomy. Use the module EMP_identify_assay
filter to obtain core species, the parameter estimate_group
specifies Group
as grouping information, the parametermin
specifies minimum relative abundance as 0.01, and the parametermin_ratio
specifies the minimum ratio as 0.7.
When the input microbial species annotation data is absolute abundance, the module
EMP_identify_assay
will automatically convert it to relative abundance during computation. This conversion facilitates filtering and identification of core species during the process, and subsequently outputs the corresponding absolute abundance data based on the filter condition.
MAE |>
EMP_assay_extract('taxonomy') |>
EMP_identify_assay(estimate_group = 'Group', method = 'default', min=0.01,min_ratio = 0.7)
5.6.2 Core genomic data
In genomic data, the edgeR package provides a filtering method based on minimum relative abundance, which can be easily invoked by the module EMP_assay_extract
to filter genomic/transcriptome data.
🏷️Example:
MAE |>
EMP_assay_extract('geno_ec') |>
EMP_identify_assay(method = 'edgeR',min = 10,min_ratio = 0.7,estimate_group = 'Group')